Computation and Language 90
☆ A Grounded Typology of Word Classes
We propose a grounded approach to meaning in language typology. We treat data
from perceptual modalities, such as images, as a language-agnostic
representation of meaning. Hence, we can quantify the function--form
relationship between images and captions across languages. Inspired by
information theory, we define "groundedness", an empirical measure of
contextual semantic contentfulness (formulated as a difference in surprisal)
which can be computed with multilingual multimodal language models. As a proof
of concept, we apply this measure to the typology of word classes. Our measure
captures the contentfulness asymmetry between functional (grammatical) and
lexical (content) classes across languages, but contradicts the view that
functional classes do not convey content. Moreover, we find universal trends in
the hierarchy of groundedness (e.g., nouns > adjectives > verbs), and show that
our measure partly correlates with psycholinguistic concreteness norms in
English. We release a dataset of groundedness scores for 30 languages. Our
results suggest that the grounded typology approach can provide quantitative
evidence about semantic function in language.
comment: 19 pages, 5 figures
☆ AdvPrefix: An Objective for Nuanced LLM Jailbreaks
Many jailbreak attacks on large language models (LLMs) rely on a common
objective: making the model respond with the prefix "Sure, here is (harmful
request)". While straightforward, this objective has two limitations: limited
control over model behaviors, often resulting in incomplete or unrealistic
responses, and a rigid format that hinders optimization. To address these
limitations, we introduce AdvPrefix, a new prefix-forcing objective that
enables more nuanced control over model behavior while being easy to optimize.
Our objective leverages model-dependent prefixes, automatically selected based
on two criteria: high prefilling attack success rates and low negative
log-likelihood. It can further simplify optimization by using multiple prefixes
for a single user request. AdvPrefix can integrate seamlessly into existing
jailbreak attacks to improve their performance for free. For example, simply
replacing GCG attack's target prefixes with ours on Llama-3 improves nuanced
attack success rates from 14% to 80%, suggesting that current alignment
struggles to generalize to unseen prefixes. Our work demonstrates the
importance of jailbreak objectives in achieving nuanced jailbreaks.
☆ SCBench: A KV Cache-Centric Analysis of Long-Context Methods
Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu
Long-context LLMs have enabled numerous downstream applications but also
introduced significant challenges related to computational and memory
efficiency. To address these challenges, optimizations for long-context
inference have been developed, centered around the KV cache. However, existing
benchmarks often evaluate in single-request, neglecting the full lifecycle of
the KV cache in real-world use. This oversight is particularly critical, as KV
cache reuse has become widely adopted in LLMs inference frameworks, such as
vLLM and SGLang, as well as by LLM providers, including OpenAI, Microsoft,
Google, and Anthropic. To address this gap, we introduce
SCBench(SharedContextBench), a comprehensive benchmark for evaluating
long-context methods from a KV cachecentric perspective: 1) KV cache
generation, 2) KV cache compression, 3) KV cache retrieval, 4) KV cache
loading. Specifically, SCBench uses test examples with shared context, ranging
12 tasks with two shared context modes, covering four categories of
long-context capabilities: string retrieval, semantic retrieval, global
information, and multi-task. With it, we provide an extensive KV cache-centric
analysis of eight categories long-context solutions, including Gated Linear
RNNs, Mamba-Attention hybrids, and efficient methods such as sparse attention,
KV cache dropping, quantization, retrieval, loading, and prompt compression.
The evaluation is conducted on 8 long-context LLMs. Our findings show that
sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding
with O(n) memory and sub-O(n^2) pre-filling computation perform robustly.
Dynamic sparsity yields more expressive KV caches than static patterns, and
layer-level sparsity in hybrid architectures reduces memory usage with strong
performance. Additionally, we identify attention distribution shift issues in
long-generation scenarios. https://aka.ms/SCBench.
☆ Interlocking-free Selective Rationalization Through Genetic-based Learning
A popular end-to-end architecture for selective rationalization is the
select-then-predict pipeline, comprising a generator to extract highlights fed
to a predictor. Such a cooperative system suffers from suboptimal equilibrium
minima due to the dominance of one of the two modules, a phenomenon known as
interlocking. While several contributions aimed at addressing interlocking,
they only mitigate its effect, often by introducing feature-based heuristics,
sampling, and ad-hoc regularizations. We present GenSPP, the first
interlocking-free architecture for selective rationalization that does not
require any learning overhead, as the above-mentioned. GenSPP avoids
interlocking by performing disjoint training of the generator and predictor via
genetic global search. Experiments on a synthetic and a real-world benchmark
show that our model outperforms several state-of-the-art competitors.
☆ DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, Chong Ruan
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE)
Vision-Language Models that significantly improves upon its predecessor,
DeepSeek-VL, through two key major upgrades. For the vision component, we
incorporate a dynamic tiling vision encoding strategy designed for processing
high-resolution images with different aspect ratios. For the language
component, we leverage DeepSeekMoE models with the Multi-head Latent Attention
mechanism, which compresses Key-Value cache into latent vectors, to enable
efficient inference and high throughput. Trained on an improved vision-language
dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks,
including but not limited to visual question answering, optical character
recognition, document/table/chart understanding, and visual grounding. Our
model series is composed of three variants: DeepSeek-VL2-Tiny,
DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated
parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art
performance with similar or fewer activated parameters compared to existing
open-source dense and MoE-based models. Codes and pre-trained models are
publicly accessible at https://github.com/deepseek-ai/DeepSeek-VL2.
☆ Still "Talking About Large Language Models": Some Clarifications
My paper "Talking About Large Language Models" has more than once been
interpreted as advocating a reductionist stance towards large language models.
But the paper was not intended that way, and I do not endorse such positions.
This short note situates the paper in the context of a larger philosophical
project that is concerned with the (mis)use of words rather than metaphysics,
in the spirit of Wittgenstein's later writing.
☆ One world, one opinion? The superstar effect in LLM responses
As large language models (LLMs) are shaping the way information is shared and
accessed online, their opinions have the potential to influence a wide
audience. This study examines who the LLMs view as the most prominent figures
across various fields, using prompts in ten different languages to explore the
influence of linguistic diversity. Our findings reveal low diversity in
responses, with a small number of figures dominating recognition across
languages (also known as the "superstar effect"). These results highlight the
risk of narrowing global knowledge representation when LLMs retrieve subjective
information.
☆ Benchmarking Linguistic Diversity of Large Language Models
The development and evaluation of Large Language Models (LLMs) has primarily
focused on their task-solving capabilities, with recent models even surpassing
human performance in some areas. However, this focus often neglects whether
machine-generated language matches the human level of diversity, in terms of
vocabulary choice, syntactic construction, and expression of meaning, raising
questions about whether the fundamentals of language generation have been fully
addressed. This paper emphasizes the importance of examining the preservation
of human linguistic richness by language models, given the concerning surge in
online content produced or aided by LLMs. We propose a comprehensive framework
for evaluating LLMs from various linguistic diversity perspectives including
lexical, syntactic, and semantic dimensions. Using this framework, we benchmark
several state-of-the-art LLMs across all diversity dimensions, and conduct an
in-depth case study for syntactic diversity. Finally, we analyze how different
development and deployment choices impact the linguistic diversity of LLM
outputs.
☆ Reasoner Outperforms: Generative Stance Detection with Rationalization for Social Media
Stance detection is crucial for fostering a human-centric Web by analyzing
user-generated content to identify biases and harmful narratives that undermine
trust. With the development of Large Language Models (LLMs), existing
approaches treat stance detection as a classification problem, providing robust
methodologies for modeling complex group interactions and advancing
capabilities in natural language tasks. However, these methods often lack
interpretability, limiting their ability to offer transparent and
understandable justifications for predictions. This study adopts a generative
approach, where stance predictions include explicit, interpretable rationales,
and integrates them into smaller language models through single-task and
multitask learning. We find that incorporating reasoning into stance detection
enables the smaller model (FlanT5) to outperform GPT-3.5's zero-shot
performance, achieving an improvement of up to 9.57%. Moreover, our results
show that reasoning capabilities enhance multitask learning performance but may
reduce effectiveness in single-task settings. Crucially, we demonstrate that
faithful rationales improve rationale distillation into SLMs, advancing efforts
to build interpretable, trustworthy systems for addressing discrimination,
fostering trust, and promoting equitable engagement on social media.
☆ Targeted Angular Reversal of Weights (TARS) for Knowledge Removal in Large Language Models
The sheer scale of data required to train modern large language models (LLMs)
poses significant risks, as models are likely to gain knowledge of sensitive
topics such as bio-security, as well the ability to replicate copyrighted
works. Methods designed to remove such knowledge must do so from all prompt
directions, in a multi-lingual capacity and without degrading general model
performance. To this end, we introduce the targeted angular reversal (TARS)
method of knowledge removal from LLMs. The TARS method firstly leverages the
LLM in combination with a detailed prompt to aggregate information about a
selected concept in the internal representation space of the LLM. It then
refines this approximate concept vector to trigger the concept token with high
probability, by perturbing the approximate concept vector with noise and
transforming it into token scores with the language model head. The feedforward
weight vectors in the LLM which operate directly on the internal representation
space, and have the highest cosine similarity with this targeting vector, are
then replaced by a reversed targeting vector, thus limiting the ability of the
concept to propagate through the model. The modularity of the TARS method
allows for a sequential removal of concepts from Llama 3.1 8B, such as the
famous literary detective Sherlock Holmes, and the planet Saturn. It is
demonstrated that the probability of triggering target concepts can be reduced
to 0.00 with as few as 1 TARS edit, whilst simultaneously removing the
knowledge bi-directionally. Moreover, knowledge is shown to be removed across
all languages despite only being targeted in English. Importantly, TARS has
minimal impact on the general model capabilities, as after removing 5 diverse
concepts in a modular fashion, there is minimal KL divergence in the next token
probabilities of the LLM on large corpora of Wikipedia text (median of 0.002).
comment: 14 pages, 5 figures, 1 table
☆ Efficient Continual Pre-training of LLMs for Low-resource Languages
Open-source Large Language models (OsLLMs) propel the democratization of
natural language research by giving the flexibility to augment or update model
parameters for performance improvement. Nevertheless, like proprietary LLMs,
Os-LLMs offer poorer performance on low-resource languages (LRLs) than
high-resource languages (HRLs), owing to smaller amounts of training data and
underrepresented vocabulary. On the other hand, continual pre-training (CPT)
with large amounts of language-specific data is a costly proposition in terms
of data acquisition and computational resources. Our goal is to drastically
reduce CPT cost. To that end, we first develop a new algorithm to select a
subset of texts from a larger corpus. We show the effectiveness of our
technique using very little CPT data. In search of further improvement, we
design a new algorithm to select tokens to include in the LLM vocabulary. We
experiment with the recent Llama-3 model and nine Indian languages with diverse
scripts and extent of resource availability. For evaluation, we use
IndicGenBench, a generation task benchmark dataset for Indic languages. We
experiment with various CPT corpora and augmented vocabulary size and offer
insights across language families.
☆ How good is my story? Towards quantitative metrics for evaluating LLM-generated XAI narratives
A rapidly developing application of LLMs in XAI is to convert quantitative
explanations such as SHAP into user-friendly narratives to explain the
decisions made by smaller prediction models. Evaluating the narratives without
relying on human preference studies or surveys is becoming increasingly
important in this field. In this work we propose a framework and explore
several automated metrics to evaluate LLM-generated narratives for explanations
of tabular classification tasks. We apply our approach to compare several
state-of-the-art LLMs across different datasets and prompt types. As a
demonstration of their utility, these metrics allow us to identify new
challenges related to LLM hallucinations for XAI narratives.
☆ Retrieval-Augmented Semantic Parsing: Using Large Language Models to Improve Generalization
Open-domain semantic parsing remains a challenging task, as models often rely
on heuristics and struggle to handle unseen concepts. In this paper, we
investigate the potential of large language models (LLMs) for this task and
introduce Retrieval-Augmented Semantic Parsing (RASP), a simple yet effective
approach that integrates external lexical knowledge into the parsing process.
Our experiments not only show that LLMs outperform previous encoder-decoder
baselines for semantic parsing, but that RASP further enhances their ability to
predict unseen concepts, nearly doubling the performance of previous models on
out-of-distribution concepts. These findings highlight the promise of
leveraging large language models and retrieval mechanisms for robust and
open-domain semantic parsing.
comment: Submitted to ARR
☆ VLR-Bench: Multilingual Benchmark Dataset for Vision-Language Retrieval Augmented Generation COLING 2025
Hyeonseok Lim, Dongjae Shin, Seohyun Song, Inho Won, Minjun Kim, Junghun Yuk, Haneol Jang, KyungTae Lim
We propose the VLR-Bench, a visual question answering (VQA) benchmark for
evaluating vision language models (VLMs) based on retrieval augmented
generation (RAG). Unlike existing evaluation datasets for external
knowledge-based VQA, the proposed VLR-Bench includes five input passages. This
allows testing of the ability to determine which passage is useful for
answering a given query, a capability lacking in previous research. In this
context, we constructed a dataset of 32,000 automatically generated
instruction-following examples, which we denote as VLR-IF. This dataset is
specifically designed to enhance the RAG capabilities of VLMs by enabling them
to learn how to generate appropriate answers based on input passages. We
evaluated the validity of the proposed benchmark and training data and verified
its performance using the state-of-the-art Llama3-based VLM, the Llava-Llama-3
model. The proposed VLR-Bench and VLR-IF datasets are publicly available
online.
comment: The 31st International Conference on Computational Linguistics
(COLING 2025), 19 pages
☆ TACOMORE: Leveraging the Potential of LLMs in Corpus-based Discourse Analysis with Prompt Engineering
The capacity of LLMs to carry out automated qualitative analysis has been
questioned by corpus linguists, and it has been argued that corpus-based
discourse analysis incorporating LLMs is hindered by issues of unsatisfying
performance, hallucination, and irreproducibility. Our proposed method,
TACOMORE, aims to address these concerns by serving as an effective prompting
framework in this domain. The framework consists of four principles, i.e.,
Task, Context, Model and Reproducibility, and specifies five fundamental
elements of a good prompt, i.e., Role Description, Task Definition, Task
Procedures, Contextual Information and Output Format. We conduct experiments on
three LLMs, i.e., GPT-4o, Gemini-1.5-Pro and Gemini-1.5.Flash, and find that
TACOMORE helps improve LLM performance in three representative discourse
analysis tasks, i.e., the analysis of keywords, collocates and concordances,
based on an open corpus of COVID-19 research articles. Our findings show the
efficacy of the proposed prompting framework TACOMORE in corpus-based discourse
analysis in terms of Accuracy, Ethicality, Reasoning, and Reproducibility, and
provide novel insights into the application and evaluation of LLMs in automated
qualitative studies.
☆ ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL
Despite the significant advancements in Text-to-SQL (Text2SQL) facilitated by
large language models (LLMs), the latest state-of-the-art techniques are still
trapped in the in-context learning of closed-source LLMs (e.g., GPT-4), which
limits their applicability in open scenarios. To address this challenge, we
propose a novel RObust mUltitask Tuning and collaboration mEthod (ROUTE) to
improve the comprehensive capabilities of open-source LLMs for Text2SQL,
thereby providing a more practical solution. Our approach begins with
multi-task supervised fine-tuning (SFT) using various synthetic training data
related to SQL generation. Unlike existing SFT-based Text2SQL methods, we
introduced several additional SFT tasks, including schema linking, noise
correction, and continuation writing. Engaging in a variety of SQL generation
tasks enhances the model's understanding of SQL syntax and improves its ability
to generate high-quality SQL queries. Additionally, inspired by the
collaborative modes of LLM agents, we introduce a Multitask Collaboration
Prompting (MCP) strategy. This strategy leverages collaboration across several
SQL-related tasks to reduce hallucinations during SQL generation, thereby
maximizing the potential of enhancing Text2SQL performance through explicit
multitask capabilities. Extensive experiments and in-depth analyses have been
performed on eight open-source LLMs and five widely-used benchmarks. The
results demonstrate that our proposal outperforms the latest Text2SQL methods
and yields leading performance.
☆ Can LLMs Convert Graphs to Text-Attributed Graphs?
Graphs are ubiquitous data structures found in numerous real-world
applications, such as drug discovery, recommender systems, and social network
analysis. Graph neural networks (GNNs) have become a popular tool to learn node
embeddings through message passing on these structures. However, a significant
challenge arises when applying GNNs to multiple graphs with different feature
spaces, as existing GNN architectures are not designed for cross-graph feature
alignment. To address this, recent approaches introduce text-attributed graphs,
where each node is associated with a textual description, enabling the use of a
shared textual encoder to project nodes from different graphs into a unified
feature space. While promising, this method relies heavily on the availability
of text-attributed data, which can be difficult to obtain in practice. To
bridge this gap, we propose a novel method named Topology-Aware Node
description Synthesis (TANS), which leverages large language models (LLMs) to
automatically convert existing graphs into text-attributed graphs. The key idea
is to integrate topological information with each node's properties, enhancing
the LLMs' ability to explain how graph topology influences node semantics. We
evaluate our TANS on text-rich, text-limited, and text-free graphs,
demonstrating that it enables a single GNN to operate across diverse graphs.
Notably, on text-free graphs, our method significantly outperforms existing
approaches that manually design node features, showcasing the potential of LLMs
for preprocessing graph-structured data, even in the absence of textual
information. The code and data are available at
https://github.com/Zehong-Wang/TANS.
☆ ASLoRA: Adaptive Sharing Low-Rank Adaptation Across Layers
As large language models (LLMs) grow in size, traditional full fine-tuning
becomes increasingly impractical due to its high computational and storage
costs. Although popular parameter-efficient fine-tuning methods, such as LoRA,
have significantly reduced the number of tunable parameters, there is still
room for further optimization. In this work, we propose ASLoRA, a cross-layer
parameter-sharing strategy combining global sharing with partial adaptive
sharing. Specifically, we share the low-rank matrix A across all layers and
adaptively merge matrix B during training. This sharing mechanism not only
mitigates overfitting effectively but also captures inter-layer dependencies,
significantly enhancing the model's representational capability. We conduct
extensive experiments on various NLP tasks, showing that ASLoRA outperforms
LoRA while using less than 25% of the parameters, highlighting its flexibility
and superior parameter efficiency. Furthermore, in-depth analyses of the
adaptive sharing strategy confirm its significant advantages in enhancing both
model flexibility and task adaptability.
☆ Familiarity: Better Evaluation of Zero-Shot Named Entity Recognition by Quantifying Label Shifts in Synthetic Training Data
Zero-shot named entity recognition (NER) is the task of detecting named
entities of specific types (such as 'Person' or 'Medicine') without any
training examples. Current research increasingly relies on large synthetic
datasets, automatically generated to cover tens of thousands of distinct entity
types, to train zero-shot NER models. However, in this paper, we find that
these synthetic datasets often contain entity types that are semantically
highly similar to (or even the same as) those in standard evaluation
benchmarks. Because of this overlap, we argue that reported F1 scores for
zero-shot NER overestimate the true capabilities of these approaches. Further,
we argue that current evaluation setups provide an incomplete picture of
zero-shot abilities since they do not quantify the label shift (i.e., the
similarity of labels) between training and evaluation datasets. To address
these issues, we propose Familiarity, a novel metric that captures both the
semantic similarity between entity types in training and evaluation, as well as
their frequency in the training data, to provide an estimate of label shift. It
allows researchers to contextualize reported zero-shot NER scores when using
custom synthetic training datasets. Further, it enables researchers to generate
evaluation setups of various transfer difficulties for fine-grained analysis of
zero-shot NER.
comment: 8 pages, 4 figures, 5 tables
☆ Label-template based Few-Shot Text Classification with Contrastive Learning
As an algorithmic framework for learning to learn, meta-learning provides a
promising solution for few-shot text classification. However, most existing
research fail to give enough attention to class labels. Traditional basic
framework building meta-learner based on prototype networks heavily relies on
inter-class variance, and it is easily influenced by noise. To address these
limitations, we proposes a simple and effective few-shot text classification
framework. In particular, the corresponding label templates are embed into
input sentences to fully utilize the potential value of class labels, guiding
the pre-trained model to generate more discriminative text representations
through the semantic information conveyed by labels. With the continuous
influence of label semantics, supervised contrastive learning is utilized to
model the interaction information between support samples and query samples.
Furthermore, the averaging mechanism is replaced with an attention mechanism to
highlight vital semantic information. To verify the proposed scheme, four
typical datasets are employed to assess the performance of different methods.
Experimental results demonstrate that our method achieves substantial
performance enhancements and outperforms existing state-of-the-art models on
few-shot text classification tasks.
☆ MALAMUTE: A Multilingual, Highly-granular, Template-free, Education-based Probing Dataset
Language models (LMs) have excelled in various broad domains. However, to
ensure their safe and effective integration into real-world educational
settings, they must demonstrate proficiency in specific, granular areas of
knowledge. Existing cloze-style benchmarks, commonly used to evaluate LMs'
knowledge, have three major limitations. They: 1) do not cover the educational
domain; 2) typically focus on low-complexity, generic knowledge or broad
domains, which do not adequately assess the models' knowledge in specific
subjects; and 3) often rely on templates that can bias model predictions. Here,
we introduce MALAMUTE, a multilingual, template-free, and highly granular
probing dataset comprising expert-written, peer-reviewed probes from 71
university-level textbooks across three languages (English, Spanish, and
Polish). MALAMUTE is the first education-based cloze-style dataset. It covers
eight domains, each with up to 14 subdomains, further broken down into concepts
and concept-based prompts, totaling 33,361 university curriculum concepts and
116,887 prompts. MALAMUTE's fine granularity, educational focus, and inclusion
of both sentence-level and paragraph-level prompts make it an ideal tool for
evaluating LMs' course-related knowledge. Our evaluation of masked and causal
LMs on MALAMUTE shows that despite overall proficiency, they have significant
gaps in knowledge when examined closely on specific subjects, hindering their
safe use in classrooms and underscoring the need for further development.
☆ RETQA: A Large-Scale Open-Domain Tabular Question Answering Dataset for Real Estate Sector AAAI 2025
The real estate market relies heavily on structured data, such as property
details, market trends, and price fluctuations. However, the lack of
specialized Tabular Question Answering datasets in this domain limits the
development of automated question-answering systems. To fill this gap, we
introduce RETQA, the first large-scale open-domain Chinese Tabular Question
Answering dataset for Real Estate. RETQA comprises 4,932 tables and 20,762
question-answer pairs across 16 sub-fields within three major domains: property
information, real estate company finance information and land auction
information. Compared with existing tabular question answering datasets, RETQA
poses greater challenges due to three key factors: long-table structures,
open-domain retrieval, and multi-domain queries. To tackle these challenges, we
propose the SLUTQA framework, which integrates large language models with
spoken language understanding tasks to enhance retrieval and answering
accuracy. Extensive experiments demonstrate that SLUTQA significantly improves
the performance of large language models on RETQA by in-context learning. RETQA
and SLUTQA provide essential resources for advancing tabular question answering
research in the real estate domain, addressing critical challenges in
open-domain and long-table question-answering. The dataset and code are
publicly available at \url{https://github.com/jensen-w/RETQA}.
comment: This paper is accepted by AAAI 2025
☆ AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation
Detecting sarcasm effectively requires a nuanced understanding of context,
including vocal tones and facial expressions. The progression towards
multimodal computational methods in sarcasm detection, however, faces
challenges due to the scarcity of data. To address this, we present AMuSeD
(Attentive deep neural network for MUltimodal Sarcasm dEtection incorporating
bi-modal Data augmentation). This approach utilizes the Multimodal Sarcasm
Detection Dataset (MUStARD) and introduces a two-phase bimodal data
augmentation strategy. The first phase involves generating varied text samples
through Back Translation from several secondary languages. The second phase
involves the refinement of a FastSpeech 2-based speech synthesis system,
tailored specifically for sarcasm to retain sarcastic intonations. Alongside a
cloud-based Text-to-Speech (TTS) service, this Fine-tuned FastSpeech 2 system
produces corresponding audio for the text augmentations. We also investigate
various attention mechanisms for effectively merging text and audio data,
finding self-attention to be the most efficient for bimodal integration. Our
experiments reveal that this combined augmentation and attention approach
achieves a significant F1-score of 81.0% in text-audio modalities, surpassing
even models that use three modalities from the MUStARD dataset.
comment: This is a preprint version of the paper, submitted and under review
at the IEEE Transactions on Affective Computing
☆ HiTZ at VarDial 2025 NorSID: Overcoming Data Scarcity with Language Transfer and Automatic Data Annotation
Jaione Bengoetxea, Mikel Zubillaga, Ekhi Azurmendi, Maite Heredia, Julen Etxaniz, Markel Ferro, Jeremy Barnes
In this paper we present our submission for the NorSID Shared Task as part of
the 2025 VarDial Workshop (Scherrer et al., 2025), consisting of three tasks:
Intent Detection, Slot Filling and Dialect Identification, evaluated using data
in different dialects of the Norwegian language. For Intent Detection and Slot
Filling, we have fine-tuned a multitask model in a cross-lingual setting, to
leverage the xSID dataset available in 17 languages. In the case of Dialect
Identification, our final submission consists of a model fine-tuned on the
provided development set, which has obtained the highest scores within our
experiments. Our final results on the test set show that our models do not drop
in performance compared to the development set, likely due to the
domain-specificity of the dataset and the similar distribution of both subsets.
Finally, we also report an in-depth analysis of the provided datasets and their
artifacts, as well as other sets of experiments that have been carried out but
did not yield the best results. Additionally, we present an analysis on the
reasons why some methods have been more successful than others; mainly the
impact of the combination of languages and domain-specificity of the training
data on the results.
comment: Vardial 2025 NorSID Shared Task
☆ Lost in the Middle, and In-Between: Enhancing Language Models' Ability to Reason Over Long Contexts in Multi-Hop QA
Previous work finds that recent long-context language models fail to make
equal use of information in the middle of their inputs, preferring pieces of
information located at the tail ends which creates an undue bias in situations
where we would like models to be equally capable of using different parts of
the input. Thus far, the problem has mainly only been considered in settings
with single pieces of critical information, leading us to question what happens
when multiple necessary pieces of information are spread out over the inputs.
Here, we demonstrate the effects of the "lost in the middle" problem in the
multi-hop question answering setting -- in which multiple reasoning "hops" over
disconnected documents are required -- and show that performance degrades not
only with respect to the distance of information from the edges of the context,
but also between pieces of information. Additionally, we experiment with means
of alleviating the problem by reducing superfluous document contents through
knowledge graph triple extraction and summarization, and prompting models to
reason more thoroughly using chain-of-thought prompting.
☆ GAOKAO-Eval: Does high scores truly reflect strong capabilities in LLMs?
Zhikai Lei, Tianyi Liang, Hanglei Hu, Jin Zhang, Yunhua Zhou, Yunfan Shao, Linyang Li, Chenchui Li, Changbo Wang, Hang Yan, Qipeng Guo
Large Language Models (LLMs) are commonly evaluated using human-crafted
benchmarks, under the premise that higher scores implicitly reflect stronger
human-like performance. However, there is growing concern that LLMs may ``game"
these benchmarks due to data leakage, achieving high scores while struggling
with tasks simple for humans. To substantively address the problem, we create
GAOKAO-Eval, a comprehensive benchmark based on China's National College
Entrance Examination (Gaokao), and conduct ``closed-book" evaluations for
representative models released prior to Gaokao. Contrary to prevailing
consensus, even after addressing data leakage and comprehensiveness,
GAOKAO-Eval reveals that high scores still fail to truly reflect human-aligned
capabilities. To better understand this mismatch, We introduce the Rasch model
from cognitive psychology to analyze LLM scoring patterns and identify two key
discrepancies: 1) anomalous consistent performance across various question
difficulties, and 2) high variance in performance on questions of similar
difficulty. In addition, We identified inconsistent grading of LLM-generated
answers among teachers and recurring mistake patterns. we find that the
phenomenons are well-grounded in the motivations behind OpenAI o1, and o1's
reasoning-as-difficulties can mitigate the mismatch. These results show that
GAOKAO-Eval can reveal limitations in LLM capabilities not captured by current
benchmarks and highlight the need for more LLM-aligned difficulty analysis.
comment: 10 pages, 13 figures
☆ Unsupervised Named Entity Disambiguation for Low Resource Domains EMNLP-2024
In the ever-evolving landscape of natural language processing and information
retrieval, the need for robust and domain-specific entity linking algorithms
has become increasingly apparent. It is crucial in a considerable number of
fields such as humanities, technical writing and biomedical sciences to enrich
texts with semantics and discover more knowledge. The use of Named Entity
Disambiguation (NED) in such domains requires handling noisy texts, low
resource settings and domain-specific KBs. Existing approaches are mostly
inappropriate for such scenarios, as they either depend on training data or are
not flexible enough to work with domain-specific KBs. Thus in this work, we
present an unsupervised approach leveraging the concept of Group Steiner Trees
(GST), which can identify the most relevant candidates for entity
disambiguation using the contextual similarities across candidate entities for
all the mentions present in a document. We outperform the state-of-the-art
unsupervised methods by more than 40\% (in avg.) in terms of Precision@1 across
various domain-specific datasets.
comment: Accepted in EMNLP-2024
☆ Automated Collection of Evaluation Dataset for Semantic Search in Low-Resource Domain Language COLING 2025
Domain-specific languages that use a lot of specific terminology often fall
into the category of low-resource languages. Collecting test datasets in a
narrow domain is time-consuming and requires skilled human resources with
domain knowledge and training for the annotation task. This study addresses the
challenge of automated collecting test datasets to evaluate semantic search in
low-resource domain-specific German language of the process industry. Our
approach proposes an end-to-end annotation pipeline for automated query
generation to the score reassessment of query-document pairs. To overcome the
lack of text encoders trained in the German chemistry domain, we explore a
principle of an ensemble of "weak" text encoders trained on common knowledge
datasets. We combine individual relevance scores from diverse models to
retrieve document candidates and relevance scores generated by an LLM, aiming
to achieve consensus on query-document alignment. Evaluation results
demonstrate that the ensemble method significantly improves alignment with
human-assigned relevance scores, outperforming individual models in both
inter-coder agreement and accuracy metrics. These findings suggest that
ensemble learning can effectively adapt semantic search systems for
specialized, low-resource languages, offering a practical solution to resource
limitations in domain-specific contexts.
comment: accepted in the First Workshop on Language Models for Low-Resource
Languages (LoResLM) co-located with the 31st International Conference on
Computational Linguistics (COLING 2025)
☆ The role of inhibitory control in garden-path sentence processing: A Chinese-English bilingual perspective
In reading garden-path sentences, people must resolve competing
interpretations, though initial misinterpretations can linger despite
reanalysis. This study examines the role of inhibitory control (IC) in managing
these misinterpretations among Chinese-English bilinguals. Using self-paced
reading tasks, we investigated how IC influences recovery from garden-path
sentences in Chinese (L1) and its interaction with language proficiency during
English (L2) processing. Results indicate that IC does not affect garden-path
recovery in Chinese, suggesting reliance on semantic context may reduce the
need for IC. In contrast, findings for English L2 learners reveal a complex
relationship between language proficiency and IC: Participants with low L2
proficiency but high IC showed lingering misinterpretations, while those with
high proficiency exhibited none. These results support and extend the Model of
Cognitive Control (Ness et al., 2023). Moreover, our comparison of three Stroop
task versions identifies L1 colour-word Stroop task as the preferred measure of
IC in bilingual research.
☆ A Comparative Study of LLMs, NMT Models, and Their Combination in Persian-English Idiom Translation
Large language models (LLMs) have shown superior capabilities in translating
figurative language compared to neural machine translation (NMT) systems.
However, the impact of different prompting methods and LLM-NMT combinations on
idiom translation has yet to be thoroughly investigated. This paper introduces
two parallel datasets of sentences containing idiomatic expressions for
Persian$\rightarrow$English and English$\rightarrow$Persian translations, with
Persian idioms sampled from our PersianIdioms resource, a collection of 2,200
idioms and their meanings. Using these datasets, we evaluate various open- and
closed-source LLMs, NMT models, and their combinations. Translation quality is
assessed through idiom translation accuracy and fluency. We also find that
automatic evaluation methods like LLM-as-a-judge, BLEU and BERTScore are
effective for comparing different aspects of model performance. Our experiments
reveal that Claude-3.5-Sonnet delivers outstanding results in both translation
directions. For English$\rightarrow$Persian, combining weaker LLMs with Google
Translate improves results, while Persian$\rightarrow$English translations
benefit from single prompts for simpler models and complex prompts for advanced
ones.
☆ Small Language Model as Data Prospector for Large Language Model
The quality of instruction data directly affects the performance of
fine-tuned Large Language Models (LLMs). Previously, \cite{li2023one} proposed
\texttt{NUGGETS}, which identifies and selects high-quality quality data from a
large dataset by identifying those individual instruction examples that can
significantly improve the performance of different tasks after being learnt as
one-shot instances. In this work, we propose \texttt{SuperNUGGETS}, an improved
variant of \texttt{NUGGETS} optimised for efficiency and performance. Our
\texttt{SuperNUGGETS} uses a small language model (SLM) instead of a large
language model (LLM) to filter the data for outstanding one-shot instances and
refines the predefined set of tests. The experimental results show that the
performance of \texttt{SuperNUGGETS} only decreases by 1-2% compared to
\texttt{NUGGETS}, but the efficiency can be increased by a factor of 58.
Compared to the original \texttt{NUGGETS}, our \texttt{SuperNUGGETS} has a
higher utility value due to the significantly lower resource consumption.
☆ Romanized to Native Malayalam Script Transliteration Using an Encoder-Decoder Framework
In this work, we present the development of a reverse transliteration model
to convert romanized Malayalam to native script using an encoder-decoder
framework built with attention-based bidirectional Long Short Term Memory
(Bi-LSTM) architecture. To train the model, we have used curated and combined
collection of 4.3 million transliteration pairs derived from publicly available
Indic language translitertion datasets, Dakshina and Aksharantar. We evaluated
the model on two different test dataset provided by IndoNLP-2025-Shared-Task
that contain, (1) General typing patterns and (2) Adhoc typing patterns,
respectively. On the Test Set-1, we obtained a character error rate (CER) of
7.4%. However upon Test Set-2, with adhoc typing patterns, where most vowel
indicators are missing, our model gave a CER of 22.7%.
comment: 5 pages
☆ Enhancing Nursing and Elderly Care with Large Language Models: An AI-Driven Framework
This paper explores the application of large language models (LLMs) in
nursing and elderly care, focusing on AI-driven patient monitoring and
interaction. We introduce a novel Chinese nursing dataset and implement
incremental pre-training (IPT) and supervised fine-tuning (SFT) techniques to
enhance LLM performance in specialized tasks. Using LangChain, we develop a
dynamic nursing assistant capable of real-time care and personalized
interventions. Experimental results demonstrate significant improvements,
paving the way for AI-driven solutions to meet the growing demands of
healthcare in aging populations.
☆ Simulating Hard Attention Using Soft Attention
We study conditions under which transformers using soft attention can
simulate hard attention, that is, effectively focus all attention on a subset
of positions. First, we examine several variants of linear temporal logic,
whose formulas have been previously been shown to be computable using hard
attention transformers. We demonstrate how soft attention transformers can
compute formulas of these logics using unbounded positional embeddings or
temperature scaling. Second, we demonstrate how temperature scaling allows
softmax transformers to simulate a large subclass of average-hard attention
transformers, those that have what we call the uniform-tieless property.
☆ Low-Resource Fast Text Classification Based on Intra-Class and Inter-Class Distance Calculation
In recent years, text classification methods based on neural networks and
pre-trained models have gained increasing attention and demonstrated excellent
performance. However, these methods still have some limitations in practical
applications: (1) They typically focus only on the matching similarity between
sentences. However, there exists implicit high-value information both within
sentences of the same class and across different classes, which is very crucial
for classification tasks. (2) Existing methods such as pre-trained language
models and graph-based approaches often consume substantial memory for training
and text-graph construction. (3) Although some low-resource methods can achieve
good performance, they often suffer from excessively long processing times. To
address these challenges, we propose a low-resource and fast text
classification model called LFTC. Our approach begins by constructing a
compressor list for each class to fully mine the regularity information within
intra-class data. We then remove redundant information irrelevant to the target
classification to reduce processing time. Finally, we compute the similarity
distance between text pairs for classification. We evaluate LFTC on 9 publicly
available benchmark datasets, and the results demonstrate significant
improvements in performance and processing time, especially under limited
computational and data resources, highlighting its superior advantages.
☆ Enhancing the Reasoning Capabilities of Small Language Models via Solution Guidance Fine-Tuning COLING 2025
Large language models (LLMs) have demonstrated remarkable performance across
a wide range of tasks. Advances in prompt engineering and fine-tuning
techniques have further enhanced their ability to address complex reasoning
challenges. However, these advanced capabilities are often exclusive to models
exceeding 100 billion parameters. Although Chain-of-Thought (CoT) fine-tuning
methods have been explored for smaller models (under 10 billion parameters),
they typically depend on extensive CoT training data, which can introduce
inconsistencies and limit effectiveness in low-data settings. To overcome these
limitations, this paper introduce a new reasoning strategy Solution Guidance
(SG) and a plug-and-play training paradigm Solution-Guidance Fine-Tuning (SGFT)
for enhancing the reasoning capabilities of small language models. SG focuses
on problem understanding and decomposition at the semantic and logical levels,
rather than specific computations, which can effectively improve the SLMs'
generalization and reasoning abilities. With only a small amount of SG training
data, SGFT can fine-tune a SLM to produce accurate problem-solving guidances,
which can then be flexibly fed to any SLM as prompts, enabling it to generate
correct answers directly. Experimental results demonstrate that our method
significantly improves the performance of SLMs on various reasoning tasks,
enhancing both their practicality and efficiency within resource-constrained
environments.
comment: 11 pages, 4 figures, to be published in The 31st International
Conference on Computational Linguistics (COLING 2025)
☆ Analyzing Fairness of Computer Vision and Natural Language Processing Models
Machine learning (ML) algorithms play a crucial role in decision making
across diverse fields such as healthcare, finance, education, and law
enforcement. Despite their widespread adoption, these systems raise ethical and
social concerns due to potential biases and fairness issues. This study focuses
on evaluating and improving the fairness of Computer Vision and Natural
Language Processing (NLP) models applied to unstructured datasets, emphasizing
how biased predictions can reinforce existing systemic inequalities. A publicly
available dataset from Kaggle was utilized to simulate a practical scenario for
examining fairness in ML workflows. To address and mitigate biases, the study
employed two leading fairness libraries: Fairlearn by Microsoft, and AIF360 by
IBM. These tools offer comprehensive frameworks for fairness analysis,
including metrics evaluation, result visualization, and bias mitigation
techniques. The research aims to measure bias levels in ML models, compare the
effectiveness of these fairness libraries, and provide actionable
recommendations for practitioners. The results demonstrate that each library
possesses distinct strengths and limitations in evaluating and mitigating
fairness. By systematically analyzing these tools, the study contributes
valuable insights to the growing field of ML fairness, offering practical
guidance for integrating fairness solutions into real world applications. This
research underscores the importance of building more equitable and responsible
machine learning systems.
comment: 16 pages, 1 table, 4 figures
☆ Benchmarking Table Comprehension In The Wild
Large Language Models (LLMs), while being increasingly dominant on a myriad
of knowledge-intensive activities, have only had limited success understanding
lengthy table-text mixtures, such as academic papers and financial reports.
Recent advances of long-context LLMs have opened up new possibilities for this
field. Nonetheless, we identify two roadblocks: (1) Prior benchmarks of table
question answering (TableQA) have focused on isolated tables without context,
making it hard to evaluate models in real-world scenarios. (2) Prior benchmarks
have focused on some narrow skill sets of table comprehension such as table
recognition, data manipulation/calculation, table summarization etc., while a
skilled human employs those skills collectively. In this work, we introduce
TableQuest, a new benchmark designed to evaluate the holistic table
comprehension capabilities of LLMs in the natural table-rich context of
financial reports. We employ a rigorous data processing and filtering procedure
to ensure that the question-answer pairs are logical, reasonable, and diverse.
We experiment with 7 state-of-the-art models, and find that despite reasonable
accuracy in locating facts, they often falter when required to execute more
sophisticated reasoning or multi-step calculations. We conclude with a
qualitative study of the failure modes and discuss the challenges of
constructing a challenging benchmark. We make the evaluation data, judging
procedure and results of this study publicly available to facilitate research
in this field.
comment: Accepted at TRL Workshop@Neurips 2024. Link to data
https://github.com/boson-ai/Table_eval_public
☆ On the Limit of Language Models as Planning Formalizers
Large Language Models have been shown to fail to create executable and
verifiable plans in grounded environments. An emerging line of work shows
success in using LLM as a formalizer to generate a formal representation (e.g.,
PDDL) of the planning domain, which can be deterministically solved to find a
plan. We systematically evaluate this methodology while bridging some major
gaps. While previous work only generates a partial PDDL representation given
templated and thus unrealistic environment descriptions, we generate the
complete representation given descriptions of various naturalness levels. Among
an array of observations critical to improve LLMs' formal planning ability, we
note that large enough models can effectively formalize descriptions as PDDL,
outperforming those directly generating plans, while being robust to lexical
perturbation. As the descriptions become more natural-sounding, we observe a
decrease in performance and provide detailed error analysis.
☆ Byte Latent Transformer: Patches Scale Better Than Tokens
Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, Srinivasan Iyer
We introduce the Byte Latent Transformer (BLT), a new byte-level LLM
architecture that, for the first time, matches tokenization-based LLM
performance at scale with significant improvements in inference efficiency and
robustness. BLT encodes bytes into dynamically sized patches, which serve as
the primary units of computation. Patches are segmented based on the entropy of
the next byte, allocating more compute and model capacity where increased data
complexity demands it. We present the first FLOP controlled scaling study of
byte-level models up to 8B parameters and 4T training bytes. Our results
demonstrate the feasibility of scaling models trained on raw bytes without a
fixed vocabulary. Both training and inference efficiency improve due to
dynamically selecting long patches when data is predictable, along with
qualitative improvements on reasoning and long tail generalization. Overall,
for fixed inference costs, BLT shows significantly better scaling than
tokenization-based models, by simultaneously growing both patch and model size.
☆ Human-Like Embodied AI Interviewer: Employing Android ERICA in Real International Conference COLING 2025
This paper introduces the human-like embodied AI interviewer which integrates
android robots equipped with advanced conversational capabilities, including
attentive listening, conversational repairs, and user fluency adaptation.
Moreover, it can analyze and present results post-interview. We conducted a
real-world case study at SIGDIAL 2024 with 42 participants, of whom 69%
reported positive experiences. This study demonstrated the system's
effectiveness in conducting interviews just like a human and marked the first
employment of such a system at an international conference. The demonstration
video is available at https://youtu.be/jCuw9g99KuE.
comment: This paper has been accepted for demonstration presentation at
International Conference on Computational Linguistics (COLING 2025)
☆ Financial Sentiment Analysis: Leveraging Actual and Synthetic Data for Supervised Fine-tuning
The Efficient Market Hypothesis (EMH) highlights the essence of financial
news in stock price movement. Financial news comes in the form of corporate
announcements, news titles, and other forms of digital text. The generation of
insights from financial news can be done with sentiment analysis.
General-purpose language models are too general for sentiment analysis in
finance. Curated labeled data for fine-tuning general-purpose language models
are scare, and existing fine-tuned models for sentiment analysis in finance do
not capture the maximum context width. We hypothesize that using actual and
synthetic data can improve performance. We introduce BertNSP-finance to
concatenate shorter financial sentences into longer financial sentences, and
finbert-lc to determine sentiment from digital text. The results show improved
performance on the accuracy and the f1 score for the financial phrasebank data
with $50\%$ and $100\%$ agreement levels.
☆ Low-Rank Adaptation with Task-Relevant Feature Enhancement for Fine-tuning Language Models AAAI 2025
Fine-tuning pre-trained large language models in a parameter-efficient manner
is widely studied for its effectiveness and efficiency. LoRA is one of the most
widely used methods, which assumes that the optimization process is essentially
low dimensional. Although LoRA has demonstrated commendable performance, there
remains a significant performance gap between LoRA and full fine-tuning when
learning new tasks. In this work, we propose Low-Rank Adaptation with
Task-Relevant Feature Enhancement(LoRATRF) for enhancing task-relevant features
from the perspective of editing neural network representations. To prioritize
task-relevant features, a task-aware filter that selectively extracts valuable
knowledge from hidden representations for the target or current task is
designed. As the experiments on a vareity of datasets including NLU,
commonsense reasoning and mathematical reasoning tasks demonstrates, our method
reduces 33.71% parameters and achieves better performance on a variety of
datasets in comparison with SOTA low-rank methods.
comment: 6 Pages, 3 figures accepted by AAAI 2025 CoLoRAI - Connecting
Low-Rank Representations in AI Workshop
☆ MERaLiON-AudioLLM: Technical Report
We introduce MERaLiON-AudioLLM (Multimodal Empathetic Reasoning and Learning
in One Network), the first speech-text model tailored for Singapore's
multilingual and multicultural landscape. Developed under the National Large
Language Models Funding Initiative, Singapore, MERaLiON-AudioLLM integrates
advanced speech and text processing to address the diverse linguistic nuances
of local accents and dialects, enhancing accessibility and usability in
complex, multilingual environments. Our results demonstrate improvements in
both speech recognition and task-specific understanding, positioning
MERaLiON-AudioLLM as a pioneering solution for region specific AI applications.
We envision this release to set a precedent for future models designed to
address localised linguistic and cultural contexts in a global framework.
☆ Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation
Multimodal large language models have experienced rapid growth, and numerous
different models have emerged. The interpretability of LVLMs remains an
under-explored area. Especially when faced with more complex tasks such as
chain-of-thought reasoning, its internal mechanisms still resemble a black box
that is difficult to decipher. By studying the interaction and information flow
between images and text, we noticed that in models such as LLaVA1.5, image
tokens that are semantically related to text are more likely to have
information flow convergence in the LLM decoding layer, and these image tokens
receive higher attention scores. However, those image tokens that are less
relevant to the text do not have information flow convergence, and they only
get very small attention scores. To efficiently utilize the image information,
we propose a new image token reduction method, Simignore, which aims to improve
the complex reasoning ability of LVLMs by computing the similarity between
image and text embeddings and ignoring image tokens that are irrelevant and
unimportant to the text. Through extensive experiments, we demonstrate the
effectiveness of our method for complex reasoning tasks. The paper's source
code can be accessed from \url{https://github.com/FanshuoZeng/Simignore}.
☆ ScaleOT: Privacy-utility-scalable Offsite-tuning with Dynamic LayerReplace and Selective Rank Compression AAAI2025
Offsite-tuning is a privacy-preserving method for tuning large language
models (LLMs) by sharing a lossy compressed emulator from the LLM owners with
data owners for downstream task tuning. This approach protects the privacy of
both the model and data owners. However, current offsite tuning methods often
suffer from adaptation degradation, high computational costs, and limited
protection strength due to uniformly dropping LLM layers or relying on
expensive knowledge distillation. To address these issues, we propose ScaleOT,
a novel privacy-utility-scalable offsite-tuning framework that effectively
balances privacy and utility. ScaleOT introduces a novel layerwise lossy
compression algorithm that uses reinforcement learning to obtain the importance
of each layer. It employs lightweight networks, termed harmonizers, to replace
the raw LLM layers. By combining important original LLM layers and harmonizers
in different ratios, ScaleOT generates emulators tailored for optimal
performance with various model scales for enhanced privacy protection.
Additionally, we present a rank reduction method to further compress the
original LLM layers, significantly enhancing privacy with negligible impact on
utility. Comprehensive experiments show that ScaleOT can achieve nearly
lossless offsite tuning performance compared with full fine-tuning while
obtaining better model privacy.
comment: accepted by AAAI2025
☆ LLM Distillation for Efficient Few-Shot Multiple Choice Question Answering
Multiple Choice Question Answering (MCQA) is an important problem with
numerous real-world applications, such as medicine, law, and education. The
high cost of building MCQA datasets makes few-shot learning pivotal in this
domain. While Large Language Models (LLMs) can enable few-shot learning, their
direct application in real-world scenarios is often hindered by their high
computational cost. To address this challenge, we propose a simple yet
effective approach that uses LLMs for data generation and scoring. Our approach
utilizes LLMs to create MCQA data which contains questions and choices, and to
assign probability scores to the generated choices. We then use the generated
data and LLM-assigned scores to finetune a smaller and more efficient
encoder-only model, DeBERTa-v3-base by leveraging distillation loss. Extensive
experiments on the Massive Multitask Language Understanding (MMLU) benchmark
demonstrate that our method improves accuracy from 28.9% to 39.3%, representing
a gain of over 10% compared to a baseline finetuned directly on 5-shot
examples. This shows the effectiveness of LLM-driven data generation and
knowledge distillation for few-shot MCQA.
☆ AutoPatent: A Multi-Agent Framework for Automatic Patent Generation
Qiyao Wang, Shiwen Ni, Huaren Liu, Shule Lu, Guhong Chen, Xi Feng, Chi Wei, Qiang Qu, Hamid Alinejad-Rokny, Yuan Lin, Min Yang
As the capabilities of Large Language Models (LLMs) continue to advance, the
field of patent processing has garnered increased attention within the natural
language processing community. However, the majority of research has been
concentrated on classification tasks, such as patent categorization and
examination, or on short text generation tasks like patent summarization and
patent quizzes. In this paper, we introduce a novel and practical task known as
Draft2Patent, along with its corresponding D2P benchmark, which challenges LLMs
to generate full-length patents averaging 17K tokens based on initial drafts.
Patents present a significant challenge to LLMs due to their specialized
nature, standardized terminology, and extensive length. We propose a
multi-agent framework called AutoPatent which leverages the LLM-based planner
agent, writer agents, and examiner agent with PGTree and RRAG to generate
lengthy, intricate, and high-quality complete patent documents. The
experimental results demonstrate that our AutoPatent framework significantly
enhances the ability to generate comprehensive patents across various LLMs.
Furthermore, we have discovered that patents generated solely with the
AutoPatent framework based on the Qwen2.5-7B model outperform those produced by
larger and more powerful LLMs, such as GPT-4o, Qwen2.5-72B, and LLAMA3.1-70B,
in both objective metrics and human evaluations. We will make the data and code
available upon acceptance at \url{https://github.com/QiYao-Wang/AutoPatent}.
comment: 19 pages, 7 figures
☆ Semi-IIN: Semi-supervised Intra-inter modal Interaction Learning Network for Multimodal Sentiment Analysis
Despite multimodal sentiment analysis being a fertile research ground that
merits further investigation, current approaches take up high annotation cost
and suffer from label ambiguity, non-amicable to high-quality labeled data
acquisition. Furthermore, choosing the right interactions is essential because
the significance of intra- or inter-modal interactions can differ among various
samples. To this end, we propose Semi-IIN, a Semi-supervised Intra-inter modal
Interaction learning Network for multimodal sentiment analysis. Semi-IIN
integrates masked attention and gating mechanisms, enabling effective dynamic
selection after independently capturing intra- and inter-modal interactive
information. Combined with the self-training approach, Semi-IIN fully utilizes
the knowledge learned from unlabeled data. Experimental results on two public
datasets, MOSI and MOSEI, demonstrate the effectiveness of Semi-IIN,
establishing a new state-of-the-art on several metrics. Code is available at
https://github.com/flow-ljh/Semi-IIN.
♻ ☆ DroidSpeak: KV Cache Sharing for Efficient Multi-LLM Serving
Yuhan Liu, Yuyang Huang, Jiayi Yao, Zhuohan Gu, Kuntai Du, Hanchen Li, Yihua Cheng, Junchen Jiang, Shan Lu, Madan Musuvathi, Esha Choukse
Large Language Models (LLMs) are increasingly employed in complex workflows,
where different LLMs and fine-tuned variants collaboratively address complex
tasks. However, these systems face significant inefficiencies due to redundant
context processing of the shared context. We propose DroidSpeak, a framework
that optimizes context sharing between fine-tuned LLMs derived from the same
foundational model. DroidSpeak identifies critical layers in the KV cache and
selectively recomputes them, enabling effective reuse of intermediate data
while maintaining high accuracy.
Our approach balances computational efficiency and task fidelity,
significantly reducing inference latency and throughput bottlenecks.
Experiments on diverse datasets and model pairs demonstrate that DroidSpeak
achieves up to 3x higher throughputs and 2.6x faster prefill times with
negligible accuracy loss compared to full recomputation.
♻ ☆ Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models
The Audio Question Answering (AQA) task includes audio event classification,
audio captioning, and open-ended reasoning. Recently, AQA has garnered
attention due to the advent of Large Audio Language Models (LALMs). Current
literature focuses on constructing LALMs by integrating audio encoders with
text-only Large Language Models (LLMs) through a projection module. While LALMs
excel in general audio understanding, they are limited in temporal reasoning,
which may hinder their commercial applications and on-device deployment. This
paper addresses these challenges and limitations in audio temporal reasoning.
First, we introduce a data augmentation technique for generating reliable audio
temporal questions and answers using an LLM. Second, we perform a further
fine-tuning of an existing baseline using curriculum learning strategy to
specialize in temporal reasoning without compromising performance on fine-tuned
tasks. We demonstrate the performance of our model using state-of-the-art LALMs
on public audio benchmark datasets. Third, we implement our AQA model on-device
locally and investigate its CPU inference for edge applications.
comment: 9 pages, 6 figures
♻ ☆ NLP Cluster Analysis of Common Core State Standards and NAEP Item Specifications
Camilli (2024) proposed a methodology using natural language processing (NLP)
to map the relationship of a set of content standards to item specifications.
This study provided evidence that NLP can be used to improve the mapping
process. As part of this investigation, the nominal classifications of
standards and items specifications were used to examine construct equivalence.
In the current paper, we determine the strength of empirical support for the
semantic distinctiveness of these classifications, which are known as "domains"
for Common Core standards, and "strands" for National Assessment of Educational
Progress (NAEP) item specifications. This is accomplished by separate k-means
clustering for standards and specifications of their corresponding embedding
vectors. We then briefly illustrate an application of these findings.
comment: 10 pages, 5 tables
♻ ☆ Building Better: Avoiding Pitfalls in Developing Language Resources when Data is Scarce
Language is a symbolic capital that affects people's lives in many ways
(Bourdieu, 1977, 1991). It is a powerful tool that accounts for identities,
cultures, traditions, and societies in general. Hence, data in a given language
should be viewed as more than a collection of tokens. Good data collection and
labeling practices are key to building more human-centered and socially aware
technologies. While there has been a rising interest in mid- to low-resource
languages within the NLP community, work in this space has to overcome unique
challenges such as data scarcity and access to suitable annotators. In this
paper, we collect feedback from those directly involved in and impacted by NLP
artefacts for mid- to low-resource languages. We conduct a quantitative and
qualitative analysis of the responses and highlight the main issues related to
(1) data quality such as linguistic and cultural data suitability; and (2) the
ethics of common annotation practices such as the misuse of online community
services. Based on these findings, we make several recommendations for the
creation of high-quality language artefacts that reflect the cultural milieu of
its speakers, while simultaneously respecting the dignity and labor of data
workers.
♻ ☆ Linguistic Minimal Pairs Elicit Linguistic Similarity in Large Language Models COLING 2025
We introduce a novel analysis that leverages linguistic minimal pairs to
probe the internal linguistic representations of Large Language Models (LLMs).
By measuring the similarity between LLM activation differences across minimal
pairs, we quantify the and gain insight into the linguistic knowledge captured
by LLMs. Our large-scale experiments, spanning 100+ LLMs and 150k minimal pairs
in three languages, reveal properties of linguistic similarity from four key
aspects: consistency across LLMs, relation to theoretical categorizations,
dependency to semantic context, and cross-lingual alignment of relevant
phenomena. Our findings suggest that 1) linguistic similarity is significantly
influenced by training data exposure, leading to higher cross-LLM agreement in
higher-resource languages. 2) Linguistic similarity strongly aligns with
fine-grained theoretical linguistic categories but weakly with broader ones. 3)
Linguistic similarity shows a weak correlation with semantic similarity,
showing its context-dependent nature. 4) LLMs exhibit limited cross-lingual
alignment in their understanding of relevant linguistic phenomena. This work
demonstrates the potential of minimal pairs as a window into the neural
representations of language in LLMs, shedding light on the relationship between
LLMs and linguistic theory. Codes and data are available at
https://github.com/ChenDelong1999/Linguistic-Similarity
comment: COLING 2025
♻ ☆ Fine Tuning Large Language Models for Medicine: The Role and Importance of Direct Preference Optimization
Thomas Savage, Stephen Ma, Abdessalem Boukil, Vishwesh Patel, Ekanath Rangan, Ivan Lopez, Jonathan H Chen
Large Language Model (LLM) fine tuning is underutilized in the field of
medicine. Two of the most common methods of fine tuning are Supervised Fine
Tuning (SFT) and Direct Preference Optimization (DPO), but there is little
guidance informing users when to use either technique. In this investigation,
we compare the performance of SFT and DPO for five common natural language
tasks in medicine: Classification with text data, Classification with numeric
data, Clinical Reasoning, Summarization, and Clinical Triage. We find that SFT
alone is sufficient for Classification with text data, whereas DPO improves
performance for the more complex tasks of Clinical Reasoning, Summarization and
Clinical Triage. Our results establish the role and importance of DPO fine
tuning within medicine, and consequently call attention to current software
gaps that prevent widespread deployment of this technique.
♻ ☆ TrustUQA: A Trustful Framework for Unified Structured Data Question Answering AAAI 2025
Wen Zhang, Long Jin, Yushan Zhu, Jiaoyan Chen, Zhiwei Huang, Junjie Wang, Yin Hua, Lei Liang, Huajun Chen
Natural language question answering (QA) over structured data sources such as
tables and knowledge graphs have been widely investigated, especially with
Large Language Models (LLMs) in recent years. The main solutions include
question to formal query parsing and retrieval-based answer generation.
However, current methods of the former often suffer from weak generalization,
failing to dealing with multi-types of sources, while the later is limited in
trustfulness. In this paper, we propose TrustUQA, a trustful QA framework that
can simultaneously support multiple types of structured data in a unified way.
To this end, it adopts an LLM-friendly and unified knowledge representation
method called Condition Graph(CG), and uses an LLM and demonstration-based
two-level method for CG querying. For enhancement, it is also equipped with
dynamic demonstration retrieval. We have evaluated TrustUQA with 5 benchmarks
covering 3 types of structured data. It outperforms 2 existing unified
structured data QA methods. In comparison with the baselines that are specific
to one data type, it achieves state-of-the-art on 2 of the datasets. Further
more, we have demonstrated the potential of our method for more general QA
tasks, QA over mixed structured data and QA across structured data. The code is
available at https://github.com/zjukg/TrustUQA.
comment: Accepted by AAAI 2025
♻ ☆ Bridging Sequence-Structure Alignment in RNA Foundation Models AAAI 2025
The alignment between RNA sequences and structures in foundation models (FMs)
has yet to be thoroughly investigated. Existing FMs have struggled to establish
sequence-structure alignment, hindering the free flow of genomic information
between RNA sequences and structures. In this study, we introduce OmniGenome,
an RNA FM trained to align RNA sequences with respect to secondary structures
based on structure-contextualised modelling. The alignment enables free and
bidirectional mappings between sequences and structures by utilising the
flexible RNA modelling paradigm that supports versatile input and output
modalities, i.e., sequence and/or structure as input/output. We implement RNA
design and zero-shot secondary structure prediction as case studies to evaluate
the Seq2Str and Str2Seq mapping capacity of OmniGenome. Results on the EternaV2
benchmark show that OmniGenome solved 74% of puzzles, whereas existing FMs only
solved up to 3% of the puzzles due to the oversight of sequence-structure
alignment. We leverage four comprehensive in-silico genome modelling benchmarks
to evaluate performance across a diverse set of genome downstream tasks, where
the results show that OmniGenome achieves state-of-the-art performance on RNA
and DNA benchmarks, even without any training on DNA genomes.
comment: Accepted by AAAI 2025
♻ ☆ Citation Amnesia: On The Recency Bias of NLP and Other Academic Fields
This study examines the tendency to cite older work across 20 fields of study
over 43 years (1980--2023). We put NLP's propensity to cite older work in the
context of these 20 other fields to analyze whether NLP shows similar temporal
citation patterns to these other fields over time or whether differences can be
observed. Our analysis, based on a dataset of approximately 240 million papers,
reveals a broader scientific trend: many fields have markedly declined in
citing older works (e.g., psychology, computer science). We term this decline a
'citation age recession', analogous to how economists define periods of reduced
economic activity. The trend is strongest in NLP and ML research (-12.8% and
-5.5% in citation age from previous peaks). Our results suggest that citing
more recent works is not directly driven by the growth in publication rates
(-3.4% across fields; -5.2% in humanities; -5.5% in formal sciences) -- even
when controlling for an increase in the volume of papers. Our findings raise
questions about the scientific community's engagement with past literature,
particularly for NLP, and the potential consequences of neglecting older but
relevant research. The data and a demo showcasing our results are publicly
available.
♻ ☆ Searching for Structure: Investigating Emergent Communication with Large Language Models
Human languages have evolved to be structured through repeated language
learning and use. These processes introduce biases that operate during language
acquisition and shape linguistic systems toward communicative efficiency. In
this paper, we investigate whether the same happens if artificial languages are
optimised for implicit biases of Large Language Models (LLMs). To this end, we
simulate a classical referential game in which LLMs learn and use artificial
languages. Our results show that initially unstructured holistic languages are
indeed shaped to have some structural properties that allow two LLM agents to
communicate successfully. Similar to observations in human experiments,
generational transmission increases the learnability of languages, but can at
the same time result in non-humanlike degenerate vocabularies. Taken together,
this work extends experimental findings, shows that LLMs can be used as tools
in simulations of language evolution, and opens possibilities for future
human-machine experiments in this field.
♻ ☆ Olympus: A Universal Task Router for Computer Vision Tasks
We introduce Olympus, a new approach that transforms Multimodal Large
Language Models (MLLMs) into a unified framework capable of handling a wide
array of computer vision tasks. Utilizing a controller MLLM, Olympus delegates
over 20 specialized tasks across images, videos, and 3D objects to dedicated
modules. This instruction-based routing enables complex workflows through
chained actions without the need for training heavy generative models. Olympus
easily integrates with existing MLLMs, expanding their capabilities with
comparable performance. Experimental results demonstrate that Olympus achieves
an average routing accuracy of 94.75% across 20 tasks and precision of 91.82%
in chained action scenarios, showcasing its effectiveness as a universal task
router that can solve a diverse range of computer vision tasks. Project page:
http://yuanze-lin.me/Olympus_page/
comment: Technical Report
♻ ☆ Frequency matters: Modeling irregular morphological patterns in Spanish with Transformers
The present paper evaluates the learning behaviour of a transformer-based
neural network with regard to an irregular inflectional paradigm. We apply the
paradigm cell filling problem to irregular patterns. We approach this problem
using the morphological reinflection task and model it as a character
sequence-to-sequence learning problem. The test case under investigation are
irregular verbs in Spanish. Besides many regular verbs in Spanish L-shaped
verbs the first person singular indicative stem irregularly matches the
subjunctive paradigm, while other indicative forms remain unaltered. We examine
the role of frequency during learning and compare models under differing input
frequency conditions. We train the model on a corpus of Spanish with a
realistic distribution of regular and irregular verbs to compare it with models
trained on input with augmented distributions of (ir)regular words. We explore
how the neural models learn this L-shaped pattern using post-hoc analyses. Our
experiments show that, across frequency conditions, the models are surprisingly
capable of learning the irregular pattern. Furthermore, our post-hoc analyses
reveal the possible sources of errors. All code and data are available at
\url{https://anonymous.4open.science/r/modeling_spanish_acl-7567/} under MIT
license.
comment: Typos and grammatical corrections
♻ ☆ Neural Text Normalization for Luxembourgish using Real-Life Variation Data
Orthographic variation is very common in Luxembourgish texts due to the
absence of a fully-fledged standard variety. Additionally, developing NLP tools
for Luxembourgish is a difficult task given the lack of annotated and parallel
data, which is exacerbated by ongoing standardization. In this paper, we
propose the first sequence-to-sequence normalization models using the ByT5 and
mT5 architectures with training data obtained from word-level real-life
variation data. We perform a fine-grained, linguistically-motivated evaluation
to test byte-based, word-based and pipeline-based models for their strengths
and weaknesses in text normalization. We show that our sequence model using
real-life variation data is an effective approach for tailor-made normalization
in Luxembourgish.
comment: Accepted at VarDial 2025
♻ ☆ Towards Cross-Lingual Audio Abuse Detection in Low-Resource Settings with Few-Shot Learning COLING 2025
Online abusive content detection, particularly in low-resource settings and
within the audio modality, remains underexplored. We investigate the potential
of pre-trained audio representations for detecting abusive language in
low-resource languages, in this case, in Indian languages using Few Shot
Learning (FSL). Leveraging powerful representations from models such as Wav2Vec
and Whisper, we explore cross-lingual abuse detection using the ADIMA dataset
with FSL. Our approach integrates these representations within the
Model-Agnostic Meta-Learning (MAML) framework to classify abusive language in
10 languages. We experiment with various shot sizes (50-200) evaluating the
impact of limited data on performance. Additionally, a feature visualization
study was conducted to better understand model behaviour. This study highlights
the generalization ability of pre-trained models in low-resource scenarios and
offers valuable insights into detecting abusive language in multilingual
contexts.
comment: Accepted as part of the proceedings of COLING 2025
♻ ☆ A Character-Centric Creative Story Generation via Imagination
Creative story generation has long been a goal of NLP research. While
existing methodologies have aimed to generate long and coherent stories, they
fall significantly short of human capabilities in terms of diversity and
character depth. To address this, we introduce a novel story generation
framework called CCI (Character-centric Creative story generation via
Imagination). CCI features two modules for creative story generation: IG
(Image-Guided Imagination) and MW (Multi-Writer model). In the IG module, we
utilize a text-to-image model to create visual representations of key story
elements, such as characters, backgrounds, and main plots, in a more novel and
concrete manner than text-only approaches. The MW module uses these story
elements to generate multiple persona-description candidates and selects the
best one to insert into the story, thereby enhancing the richness and depth of
the narrative. We compared the stories generated by CCI and baseline models
through statistical analysis, as well as human and LLM evaluations. The results
showed that the IG and MW modules significantly improve various aspects of the
stories' creativity. Furthermore, our framework enables interactive multi-modal
story generation with users, opening up new possibilities for human-LLM
integration in cultural development. Project page : https://www.2024cci.p-e.kr/
♻ ☆ GATEAU: Selecting Influential Sample for Long Context Alignment
Shuzheng Si, Haozhe Zhao, Gang Chen, Yunshui Li, Kangyang Luo, Chuancheng Lv, Kaikai An, Fanchao Qi, Baobao Chang, Maosong Sun
Aligning large language models to handle instructions with extremely long
contexts has yet to be fully investigated. Previous studies attempt to scale up
the available data volume by synthesizing long instruction-following samples,
as constructing such a dataset tends to be challenging for annotators. However,
a lack of a well-defined strategy for ensuring data quality may introduce
low-quality samples and restrict the model performance. Thus, we propose
GATEAU, a novel framework to address the unique challenge of long context
alignment by identifying the influential samples enriched with long-range
dependency relations. Specifically, GATEAU measures the long-range dependencies
from two essential aspects: the difficulty of generating target responses due
to the long-range dependencies, and the difficulty of understanding long inputs
due to such dependencies. Comprehensive experiments indicate that GATEAU
effectively identifies influential samples and the model trained on these
selected samples exhibits better instruction-following and long-context
understanding capabilities.
♻ ☆ Social Bias in Large Language Models For Bangla: An Empirical Study on Gender and Religious Bias COLING 2025
The rapid growth of Large Language Models (LLMs) has put forward the study of
biases as a crucial field. It is important to assess the influence of different
types of biases embedded in LLMs to ensure fair use in sensitive fields.
Although there have been extensive works on bias assessment in English, such
efforts are rare and scarce for a major language like Bangla. In this work, we
examine two types of social biases in LLM generated outputs for Bangla
language. Our main contributions in this work are: (1) bias studies on two
different social biases for Bangla, (2) a curated dataset for bias measurement
benchmarking and (3) testing two different probing techniques for bias
detection in the context of Bangla. This is the first work of such kind
involving bias assessment of LLMs for Bangla to the best of our knowledge. All
our code and resources are publicly available for the progress of bias related
research in Bangla NLP.
comment: Accepted at The First Workshop on Language Models for Low-Resource
Languages (LoResLM) at COLING 2025
♻ ☆ TreeEval: Benchmark-Free Evaluation of Large Language Models through Tree Planning
Recently, numerous new benchmarks have been established to evaluate the
performance of large language models (LLMs) via either computing a holistic
score or employing another LLM as a judge. However, these approaches suffer
from data leakage due to the open access of the benchmark and inflexible
evaluation process. To address this issue, we introduce $\textbf{TreeEval}$, a
benchmark-free evaluation method for LLMs that let a high-performance LLM host
an irreproducible evaluation session and essentially avoids the data leakage.
Moreover, this LLM performs as an examiner to raise up a series of questions
under a topic with a tree planing strategy, which considers the current
evaluation status to decide the next question generation and ensures the
completeness and efficiency of the evaluation process. We evaluate $6$ models
of different parameter sizes, including $7$B, $13$B, and $33$B, and ultimately
achieved the highest correlation coefficient with AlpacaEval2.0 using only
around $45$ questions. We also conduct more analysis to show the robustness and
reliability of TreeEval. Our code can be accessed via the provided
https://github.com/Ashura5/TreeEval.
♻ ☆ Clustering Algorithms and RAG Enhancing Semi-Supervised Text Classification with Large LLMs
This paper introduces a novel semi-supervised learning framework specifically
designed for text classification tasks, effectively addressing the challenge of
vast datasets with limited labeled examples. By integrating multi-level
similarity based data augmentation techniques from Retrieval-Augmented
Generation (RAG) to Large Language Model (LLM) rewriting and traditional word
substitution-we constructed an intelligent augmentation pipeline. This
framework innovatively employs the selection of representative landmarks
through clustering, which serve as intermediaries in the retrieval and
rewriting processes, ensuring that the augmented data maintains a distribution
similar to the original dataset. Empirical results show that even in complex
text document classification scenarios with over 100 categories, our method
achieves state-of-the-art accuracies of 95.41% and 82.43% on the Reuters and
Web of Science datasets, respectively. These findings highlight the
effectiveness and broad applicability of our semi-supervised learning approach
for text classification tasks.
♻ ☆ Strategic Insights in Human and Large Language Model Tactics at Word Guessing Games
At the beginning of 2022, a simplistic word-guessing game took the world by
storm and was further adapted to many languages beyond the original English
version. In this paper, we examine the strategies of daily word-guessing game
players that have evolved during a period of over two years. A survey gathered
from 25% of frequent players reveals their strategies and motivations for
continuing the daily journey. We also explore the capability of several popular
open-access large language model systems and open-source models at
comprehending and playing the game in two different languages. Results
highlight the struggles of certain models to maintain correct guess length and
generate repetitions, as well as hallucinations of non-existent words and
inflections.
♻ ☆ Benchmarking LLMs for Mimicking Child-Caregiver Language in Interaction
LLMs can generate human-like dialogues, yet their ability to simulate early
child-adult interactions remains largely unexplored. In this paper, we examined
how effectively LLMs can capture the distinctive features of child-caregiver
language in interaction, using both static and interactive benchmarking
methods. We found that state-of-the-art LLMs like Llama 3 and GPT-4o can
approximate child-caregiver dialogues at the word and utterance level, but they
struggle to reproduce the child and caregiver's discursive patterns, exaggerate
alignment, and fail to reach the level of diversity shown by humans. The
broader goal of this work is to initiate the development of a comprehensive
benchmark for LLMs in child-oriented applications.
♻ ☆ Dynamic Fog Computing for Enhanced LLM Execution in Medical Applications
The ability of large language models (LLMs) to transform, interpret, and
comprehend vast quantities of heterogeneous data presents a significant
opportunity to enhance data-driven care delivery. However, the sensitive nature
of protected health information (PHI) raises valid concerns about data privacy
and trust in remote LLM platforms. In addition, the cost associated with
cloud-based artificial intelligence (AI) services continues to impede
widespread adoption. To address these challenges, we propose a shift in the LLM
execution environment from opaque, centralized cloud providers to a
decentralized and dynamic fog computing architecture. By executing open-weight
LLMs in more trusted environments, such as the user's edge device or a fog
layer within a local network, we aim to mitigate the privacy, trust, and
financial challenges associated with cloud-based LLMs. We further present
SpeziLLM, an open-source framework designed to facilitate rapid and seamless
leveraging of different LLM execution layers and lowering barriers to LLM
integration in digital health applications. We demonstrate SpeziLLM's broad
applicability across six digital health applications, showcasing its
versatility in various healthcare settings.
♻ ☆ ViTHSD: Exploiting Hatred by Targets for Hate Speech Detection on Vietnamese Social Media Texts
The growth of social networks makes toxic content spread rapidly. Hate speech
detection is a task to help decrease the number of harmful comments. With the
diversity in the hate speech created by users, it is necessary to interpret the
hate speech besides detecting it. Hence, we propose a methodology to construct
a system for targeted hate speech detection from online streaming texts from
social media. We first introduce the ViTHSD - a targeted hate speech detection
dataset for Vietnamese Social Media Texts. The dataset contains 10K comments,
each comment is labeled to specific targets with three levels: clean,
offensive, and hate. There are 5 targets in the dataset, and each target is
labeled with the corresponding level manually by humans with strict annotation
guidelines. The inter-annotator agreement obtained from the dataset is 0.45 by
Cohen's Kappa index, which is indicated as a moderate level. Then, we construct
a baseline for this task by combining the Bi-GRU-LSTM-CNN with the pre-trained
language model to leverage the power of text representation of BERTology.
Finally, we suggest a methodology to integrate the baseline model for targeted
hate speech detection into the online streaming system for practical
application in preventing hateful and offensive content on social media.
comment: Accepted for publication at Journal of Computational Social Science
♻ ☆ Evaluation of Code LLMs on Geospatial Code Generation SP
Software development support tools have been studied for a long time, with
recent approaches using Large Language Models (LLMs) for code generation. These
models can generate Python code for data science and machine learning
applications. LLMs are helpful for software engineers because they increase
productivity in daily work. An LLM can also serve as a "mentor" for
inexperienced software developers, and be a viable learning support.
High-quality code generation with LLMs can also be beneficial in geospatial
data science. However, this domain poses different challenges, and code
generation LLMs are typically not evaluated on geospatial tasks. Here, we show
how we constructed an evaluation benchmark for code generation models, based on
a selection of geospatial tasks. We categorised geospatial tasks based on their
complexity and required tools. Then, we created a dataset with tasks that test
model capabilities in spatial reasoning, spatial data processing, and
geospatial tools usage. The dataset consists of specific coding problems that
were manually created for high quality. For every problem, we proposed a set of
test scenarios that make it possible to automatically check the generated code
for correctness. In addition, we tested a selection of existing code generation
LLMs for code generation in the geospatial domain. We share our dataset and
reproducible evaluation code on a public GitHub repository, arguing that this
can serve as an evaluation benchmark for new LLMs in the future. Our dataset
will hopefully contribute to the development new models capable of solving
geospatial coding tasks with high accuracy. These models will enable the
creation of coding assistants tailored for geospatial applications.
comment: 7th ACM SIGSPATIAL International Workshop on AI for Geographic
Knowledge Discovery (GeoAI'24)
♻ ☆ CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search
Anton Tikhonov, Nikita Sorokin, Dmitry Abulkhanov, Irina Piontkovskaya, Sergey Nikolenko, Valentin Malykh
We consider the well-known and important tasks of clone detection and
information retrieval for source code. The most standard setup is to search
clones inside the same language code snippets. But it is also useful to find
code snippets with identical behaviour in different programming languages.
Nevertheless multi- and cross-lingual clone detection has been little studied
in literature. We present a novel training procedure, cross-consistency
training (CCT) leveraging cross-lingual similarity, that we apply to train
language models on source code in various programming languages. We show that
this training is effective both for encoder- and decoder-based models. The
trained encoder-based CCT-LM model achieves a new state of the art on POJ-104
(monolingual C++ clone detection benchmark) with 96.73\% MAP and AdvTest
(monolingual Python code search benchmark) with 47.18\% MRR. The decoder-based
CCT-LM model shows comparable performance in these tasks. In addition, we
formulate the multi- and cross-lingual clone detection problem and present XCD,
a new benchmark dataset produced from CodeForces submissions.
♻ ☆ Learn and Unlearn in Multilingual LLMs
This paper investigates the propagation of harmful information in
multilingual large language models (LLMs) and evaluates the efficacy of various
unlearning methods. We demonstrate that fake information, regardless of the
language it is in, once introduced into these models through training data, can
spread across different languages, compromising the integrity and reliability
of the generated content. Our findings reveal that standard unlearning
techniques, which typically focus on English data, are insufficient in
mitigating the spread of harmful content in multilingual contexts and could
inadvertently reinforce harmful content across languages. We show that only by
addressing harmful responses in both English and the original language of the
harmful data can we effectively eliminate generations for all languages. This
underscores the critical need for comprehensive unlearning strategies that
consider the multilingual nature of modern LLMs to enhance their safety and
reliability across diverse linguistic landscapes.
♻ ☆ Bootstrapping Heterogeneous Graph Representation Learning via Large Language Models: A Generalized Approach AAAI 2025
Graph representation learning methods are highly effective in handling
complex non-Euclidean data by capturing intricate relationships and features
within graph structures. However, traditional methods face challenges when
dealing with heterogeneous graphs that contain various types of nodes and edges
due to the diverse sources and complex nature of the data. Existing
Heterogeneous Graph Neural Networks (HGNNs) have shown promising results but
require prior knowledge of node and edge types and unified node feature
formats, which limits their applicability. Recent advancements in graph
representation learning using Large Language Models (LLMs) offer new solutions
by integrating LLMs' data processing capabilities, enabling the alignment of
various graph representations. Nevertheless, these methods often overlook
heterogeneous graph data and require extensive preprocessing. To address these
limitations, we propose a novel method that leverages the strengths of both LLM
and GNN, allowing for the processing of graph data with any format and type of
nodes and edges without the need for type information or special preprocessing.
Our method employs LLM to automatically summarize and classify different data
formats and types, aligns node features, and uses a specialized GNN for
targeted learning, thus obtaining effective graph representations for
downstream tasks. Theoretical analysis and experimental validation have
demonstrated the effectiveness of our method.
comment: Accepted by AAAI 2025
♻ ☆ First Train to Generate, then Generate to Train: UnitedSynT5 for Few-Shot NLI
Natural Language Inference (NLI) tasks require identifying the relationship
between sentence pairs, typically classified as entailment, contradiction, or
neutrality. While the current state-of-the-art (SOTA) model, Entailment
Few-Shot Learning (EFL), achieves a 93.1% accuracy on the Stanford Natural
Language Inference (SNLI) dataset, further advancements are constrained by the
dataset's limitations. To address this, we propose a novel approach leveraging
synthetic data augmentation to enhance dataset diversity and complexity. We
present UnitedSynT5, an advanced extension of EFL that leverages a T5-based
generator to synthesize additional premise-hypothesis pairs, which are
rigorously cleaned and integrated into the training data. These augmented
examples are processed within the EFL framework, embedding labels directly into
hypotheses for consistency. We train a GTR-T5-XL model on this expanded
dataset, achieving a new benchmark of 94.7% accuracy on the SNLI dataset, 94.0%
accuracy on the E-SNLI dataset, and 92.6% accuracy on the MultiNLI dataset,
surpassing the previous SOTA models. This research demonstrates the potential
of synthetic data augmentation in improving NLI models, offering a path forward
for further advancements in natural language understanding tasks.
comment: 14 pages
♻ ☆ Role-playing Prompt Framework: Generation and Evaluation
Large language models (LLMs) exhibit impressive proficiency in natural
language generation, understanding user instructions, and emulating human-like
language use, which has led to significant interest in their application to
role-playing scenarios. However, the manual collection of role-specific script
data and the evaluation of model performance are resource-intensive processes.
This paper introduces a prompt-based framework designed to leverage GPT's
capabilities for the generation of role-playing dialogue datasets and the
evaluation of role-playing performance. To validate the effectiveness of the
GPT-based generation and evaluation, we further incorporate the recall-oriented
Rouge-L metric, providing an additional quantitative measure of performance.
♻ ☆ Towards Efficient Methods in Medical Question Answering using Knowledge Graph Embeddings
In Natural Language Processing (NLP), Machine Reading Comprehension (MRC) is
the task of answering a question based on a given context. To handle questions
in the medical domain, modern language models such as BioBERT, SciBERT and even
ChatGPT are trained on vast amounts of in-domain medical corpora. However,
in-domain pre-training is expensive in terms of time and resources. In this
paper, we propose a resource-efficient approach for injecting domain knowledge
into a model without relying on such domain-specific pre-training.
Knowledge graphs are powerful resources for accessing medical information.
Building on existing work, we introduce a method using Multi-Layer Perceptrons
(MLPs) for aligning and integrating embeddings extracted from medical knowledge
graphs with the embedding spaces of pre-trained language models (LMs). The
aligned embeddings are fused with open-domain LMs BERT and RoBERTa that are
fine-tuned for two MRC tasks, span detection (COVID-QA) and multiple-choice
questions (PubMedQA). We compare our method to prior techniques that rely on a
vocabulary overlap for embedding alignment and show how our method circumvents
this requirement to deliver better performance. On both datasets, our method
allows BERT/RoBERTa to either perform on par (occasionally exceeding) with
stronger domain-specific models or show improvements in general over prior
techniques. With the proposed approach, we signal an alternative method to
in-domain pre-training to achieve domain proficiency. Our code is available
here.
comment: Accepted to the MABM workshop at IEEE BIBM 2024
♻ ☆ Pretraining Vision-Language Model for Difference Visual Question Answering in Longitudinal Chest X-rays
Difference visual question answering (diff-VQA) is a challenging task that
requires answering complex questions based on differences between a pair of
images. This task is particularly important in reading chest X-ray images
because radiologists often compare multiple images of the same patient taken at
different times to track disease progression and changes in its severity in
their clinical practice. However, previous works focused on designing specific
network architectures for the diff-VQA task, missing opportunities to enhance
the model's performance using a pretrained vision-language model (VLM). Here,
we introduce a novel VLM called PLURAL, which is pretrained on natural and
longitudinal chest X-ray data for the diff-VQA task. The model is developed
using a step-by-step approach, starting with being pretrained on natural images
and texts, followed by being trained using longitudinal chest X-ray data. The
longitudinal data consist of pairs of X-ray images, along with question-answer
sets and radiologist's reports that describe the changes in lung abnormalities
and diseases over time. Our experimental results show that the PLURAL model
outperforms state-of-the-art methods not only in diff-VQA for longitudinal
X-rays but also in conventional VQA for a single X-ray image. Through extensive
experiments, we demonstrate the effectiveness of the proposed VLM architecture
and pretraining method in improving the model's performance.
♻ ☆ ReFT: Reasoning with Reinforced Fine-Tuning ACL 2024
One way to enhance the reasoning capability of Large Language Models (LLMs)
is to conduct Supervised Fine-Tuning (SFT) using Chain-of-Thought (CoT)
annotations. This approach does not show sufficiently strong generalization
ability, however, because the training only relies on the given CoT data. In
math problem-solving, for example, there is usually only one annotated
reasoning path for each question in the training data. Intuitively, it would be
better for the algorithm to learn from multiple annotated reasoning paths given
a question. To address this issue, we propose a simple yet effective approach
called Reinforced Fine-Tuning (ReFT) to enhance the generalizability of
learning LLMs for reasoning, with math problem-solving as an example. ReFT
first warmups the model with SFT, and then employs on-line reinforcement
learning, specifically the PPO algorithm in this paper, to further fine-tune
the model, where an abundance of reasoning paths are automatically sampled
given the question and the rewards are naturally derived from the ground-truth
answers. Extensive experiments on GSM8K, MathQA, and SVAMP datasets show that
ReFT significantly outperforms SFT, and the performance can be potentially
further boosted by combining inference-time strategies such as majority voting
and re-ranking. Note that ReFT obtains the improvement by learning from the
same training questions as SFT, without relying on extra or augmented training
questions. This indicates a superior generalization ability for ReFT.
comment: ACL 2024 main conference; adjust with reviewer comments; 13 pages
♻ ☆ Political Actor Agent: Simulating Legislative System for Roll Call Votes Prediction with Large Language Models AAAI 2025
Predicting roll call votes through modeling political actors has emerged as a
focus in quantitative political science and computer science. Widely used
embedding-based methods generate vectors for legislators from diverse data sets
to predict legislative behaviors. However, these methods often contend with
challenges such as the need for manually predefined features, reliance on
extensive training data, and a lack of interpretability. Achieving more
interpretable predictions under flexible conditions remains an unresolved
issue. This paper introduces the Political Actor Agent (PAA), a novel
agent-based framework that utilizes Large Language Models to overcome these
limitations. By employing role-playing architectures and simulating legislative
system, PAA provides a scalable and interpretable paradigm for predicting
roll-call votes. Our approach not only enhances the accuracy of predictions but
also offers multi-view, human-understandable decision reasoning, providing new
insights into political actor behaviors. We conducted comprehensive experiments
using voting records from the 117-118th U.S. House of Representatives,
validating the superior performance and interpretability of PAA. This study not
only demonstrates PAA's effectiveness but also its potential in political
science research.
comment: Accepted at AAAI 2025
♻ ☆ Improving Factuality in Large Language Models via Decoding-Time Hallucinatory and Truthful Comparators AAAI 2025
Despite their remarkable capabilities, Large Language Models (LLMs) are prone
to generate responses that contradict verifiable facts, i.e., unfaithful
hallucination content. Existing efforts generally focus on optimizing model
parameters or editing semantic representations, which compromise the internal
factual knowledge of target LLMs. In addition, hallucinations typically exhibit
multifaceted patterns in downstream tasks, limiting the model's holistic
performance across tasks. In this paper, we propose a Comparator-driven
Decoding-Time (CDT) framework to alleviate the response hallucination. Firstly,
we construct hallucinatory and truthful comparators with multi-task fine-tuning
samples. In this case, we present an instruction prototype-guided mixture of
experts strategy to enhance the ability of the corresponding comparators to
capture different hallucination or truthfulness patterns in distinct task
instructions. CDT constrains next-token predictions to factuality-robust
distributions by contrasting the logit differences between the target LLMs and
these comparators. Systematic experiments on multiple downstream tasks show
that our framework can significantly improve the model performance and response
factuality.
comment: Accepted by AAAI 2025
♻ ☆ AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark
We investigate the reasoning capabilities of large language models (LLMs) for
automatically generating data-cleaning workflows. To evaluate LLMs' ability to
complete data-cleaning tasks, we implemented a pipeline for LLM-based Auto Data
Cleaning Workflow (AutoDCWorkflow), prompting LLMs on data cleaning operations
to repair three types of data quality issues: duplicates, missing values, and
inconsistent data formats. Given a dirty table and a purpose (expressed as a
query), this pipeline generates a minimal, clean table sufficient to address
the purpose and the data cleaning workflow used to produce the table. The
planning process involves three main LLM-driven components: (1) Select Target
Columns: Identifies a set of target columns related to the purpose. (2) Inspect
Column Quality: Assesses the data quality for each target column and generates
a Data Quality Report as operation objectives. (3) Generate Operation &
Arguments: Predicts the next operation and arguments based on the data quality
report results. Additionally, we propose a data cleaning benchmark to evaluate
the capability of LLM agents to automatically generate workflows that address
data cleaning purposes of varying difficulty levels. The benchmark comprises
the annotated datasets as a collection of purpose, raw table, clean table, data
cleaning workflow, and answer set. In our experiments, we evaluated three LLMs
that auto-generate purpose-driven data cleaning workflows. The results indicate
that LLMs perform well in planning and generating data-cleaning workflows
without the need for fine-tuning.
♻ ☆ Too Late to Train, Too Early To Use? A Study on Necessity and Viability of Low-Resource Bengali LLMs
Tamzeed Mahfuz, Satak Kumar Dey, Ruwad Naswan, Hasnaen Adil, Khondker Salman Sayeed, Haz Sameen Shahgir
Each new generation of English-oriented Large Language Models (LLMs) exhibits
enhanced cross-lingual transfer capabilities and significantly outperforms
older LLMs on low-resource languages. This prompts the question: Is there a
need for LLMs dedicated to a particular low-resource language? We aim to
explore this question for Bengali, a low-to-moderate resource Indo-Aryan
language native to the Bengal region of South Asia.
We compare the performance of open-weight and closed-source LLMs such as
LLaMA-3 and GPT-4 against fine-tuned encoder-decoder models across a diverse
set of Bengali downstream tasks, including translation, summarization,
paraphrasing, question-answering, and natural language inference. Our findings
reveal that while LLMs generally excel in reasoning tasks, their performance in
tasks requiring Bengali script generation is inconsistent. Key challenges
include inefficient tokenization of Bengali script by existing LLMs, leading to
increased computational costs and potential performance degradation.
Additionally, we highlight biases in machine-translated datasets commonly used
for Bengali NLP tasks. We conclude that there is a significant need for a
Bengali-oriented LLM, but the field currently lacks the high-quality
pretraining and instruction-tuning datasets necessary to develop a highly
effective model.
♻ ☆ Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts
Automatic Speech Recognition (ASR) transcripts exhibit recognition errors and
various spoken language phenomena such as disfluencies, ungrammatical
sentences, and incomplete sentences, hence suffering from poor readability. To
improve readability, we propose a Contextualized Spoken-to-Written conversion
(CoS2W) task to address ASR and grammar errors and also transfer the informal
text into the formal style with content preserved, utilizing contexts and
auxiliary information. This task naturally matches the in-context learning
capabilities of Large Language Models (LLMs). To facilitate comprehensive
comparisons of various LLMs, we construct a document-level Spoken-to-Written
conversion of ASR Transcripts Benchmark (SWAB) dataset. Using SWAB, we study
the impact of different granularity levels on the CoS2W performance, and
propose methods to exploit contexts and auxiliary information to enhance the
outputs. Experimental results reveal that LLMs have the potential to excel in
the CoS2W task, particularly in grammaticality and formality, our methods
achieve effective understanding of contexts and auxiliary information by LLMs.
We further investigate the effectiveness of using LLMs as evaluators and find
that LLM evaluators show strong correlations with human evaluations on rankings
of faithfulness and formality, which validates the reliability of LLM
evaluators for the CoS2W task.
comment: 7 pages, 3 figures
♻ ☆ Embedding-Informed Adaptive Retrieval-Augmented Generation of Large Language Models
Retrieval-augmented large language models (LLMs) have been remarkably
competent in various NLP tasks. However, it was observed by previous works that
retrieval is not always helpful, especially when the LLM is already
knowledgeable on the query to answer. Motivated by this, Adaptive
Retrieval-Augmented Generation (ARAG) studies retrieving only when the
knowledge asked by the query is absent in the LLM. Previous works of ARAG
either require accessing the pre-training corpus or prompting with additional
model inferences. Aiming to avoid such drawbacks, we propose to determine
whether the model is knowledgeable on a query via inspecting the
(contextualized) pre-trained token embeddings of LLMs. We hypothesize that such
embeddings capture rich information on the model's intrinsic knowledge base,
which enables an efficient way of judging the necessity to retrieve from an
external corpus. Extensive experiments demonstrate our ARAG approach's superior
performance across various benchmarks.
♻ ☆ SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks ICML 2024
Large language models (LLMs) have proven to be highly effective across
various natural language processing tasks. However, their large number of
parameters poses significant challenges for practical deployment. Pruning, a
technique aimed at reducing the size and complexity of LLMs, offers a potential
solution by removing redundant components from the network. Despite the promise
of pruning, existing methods often struggle to achieve substantial end-to-end
LLM inference speedup. In this paper, we introduce SLEB, a novel approach
designed to streamline LLMs by eliminating redundant transformer blocks. We
choose the transformer block as the fundamental unit for pruning, because LLMs
exhibit block-level redundancy with high similarity between the outputs of
neighboring blocks. This choice allows us to effectively enhance the processing
speed of LLMs. Our experimental results demonstrate that SLEB outperforms
previous LLM pruning methods in accelerating LLM inference while also
maintaining superior perplexity and accuracy, making SLEB as a promising
technique for enhancing the efficiency of LLMs. The code is available at:
https://github.com/jiwonsong-dev/SLEB.
comment: ICML 2024
♻ ☆ Improvement in Sign Language Translation Using Text CTC Alignment
Current sign language translation (SLT) approaches often rely on gloss-based
supervision with Connectionist Temporal Classification (CTC), limiting their
ability to handle non-monotonic alignments between sign language video and
spoken text. In this work, we propose a novel method combining joint
CTC/Attention and transfer learning. The joint CTC/Attention introduces
hierarchical encoding and integrates CTC with the attention mechanism during
decoding, effectively managing both monotonic and non-monotonic alignments.
Meanwhile, transfer learning helps bridge the modality gap between vision and
language in SLT. Experimental results on two widely adopted benchmarks,
RWTH-PHOENIX-Weather 2014 T and CSL-Daily, show that our method achieves
results comparable to state-of-the-art and outperforms the pure-attention
baseline. Additionally, this work opens a new door for future research into
gloss-free SLT using text-based CTC alignment.
♻ ☆ LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, Song Han
Long-context capability is critical for multi-modal foundation models,
especially for long video understanding. We introduce LongVILA, a full-stack
solution for long-context visual-language models by co-designing the algorithm
and system. For model training, we upgrade existing VLMs to support long video
understanding by incorporating two additional stages, i.e., long context
extension and long video supervised fine-tuning. However, training on long
video is computationally and memory intensive. We introduce the long-context
Multi-Modal Sequence Parallelism (MM-SP) system that efficiently parallelizes
long video training and inference, enabling 2M context length training on 256
GPUs without any gradient checkpointing. LongVILA efficiently extends the
number of video frames of VILA from 8 to 2048, achieving 99.8% accuracy in
6,000-frame (more than 1 million tokens) video needle-in-a-haystack.
LongVILA-7B demonstrates strong accuracy on 9 popular video benchmarks, e.g.
65.1% VideoMME with subtitle. Besides, MM-SP is 2.1x - 5.7x faster than ring
style sequence parallelism and 1.1x - 1.4x faster than Megatron with a hybrid
context and tensor parallelism. Moreover, it seamlessly integrates with Hugging
Face Transformers.
comment: Code and models are available at
https://github.com/NVlabs/VILA/tree/main/longvila